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Abstract 

^ | 1 Principal component analysis (PCA) is one of the most commonly used statistical procedures 

with a wide range of applications. This paper considers both minimax and adaptive estimation 
of the principal subspace in the high dimensional setting. Under mild technical conditions, we 
first establish the optimal rates of convergence for estimating the principal subspace which are 
sharp with respect to all the parameters, thus providing a complete characterization of the dif- 
ficulty of the estimation problem in term of the convergence rate. The lower bound is obtained 
by calculating the local metric entropy and an application of Fano's Lemma. The rate opti- 
mal estimator is constructed using aggregation, which, however, might not be computationally 
, feasible. 

We then introduce an adaptive procedure for estimating the principal subspace which is 
fully data driven and can be computed efficiently. It is shown that the estimator attains the 
optimal rates of convergence simultaneously over a large collection of the parameter spaces. A 
key idea in our construction is a reduction scheme which reduces the sparse PCA problem to 
a high-dimensional multivariate regression problem. This method is potentially also useful for 
other related problems. 
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1 Introduction 



Due to dramatic advances in science and technology, high-dimensional data are now routinely 
collected in a wide range of fields including genomics, signal processing, risk management, and 
portfolio allocation. In many applications, the signal of interest lies in a subspace of much lower 
dimension and the between-sample variation is determined by a small number of factors. For 
example, in spectroscopy, the variation of the infrared and ultraviolet spectra is driven by the 



concentration levels of a small number of chemical components in the system [50|. In financial 
econometrics, it is commonly believed that the variation in asset returns is driven by a small 
number of common factors combined with random noise [3] . 

Principal component analysis (PCA) is one of the most commonly used technique in multivariate 
analysis for dimension reduction and feature extraction, and is particularly well suited for the 
settings where the data is high-dimensional but the signal has a low-dimensional structure. PCA 
has a wide array of applications, ranging from image recognition to data compression to clustering. 
In the conventional setting where the dimension of the data is relatively small compared with the 
sample size, the principal eigenvectors of the covariance matrix is typically estimated by the leading 
eigenvectors of the sample covariance matrix which are consistent when the dimension p is fixed 
and the sample size n increases {^J. However, in the high-dimensional setting where p can be much 
larger than n, this approach leads to very poor estimates. At various levels of rigor and generality, a 



series of papers [21|, |4|, |41|, [38|, [25|, [28|, l9[ showed that the sample principal eigenvectors are no longer 
consistent estimates of their population counterparts. For example, Baik and Silverstein |4] and 
Paul [4l( showed that if p/n — )■ 7 G (0, 1) as n — > 00, and the largest eigenvalue Ai < ypy and is of 
unit multiplicity, then the leading sample principal eigenvector vi is asymptotically almost surely 
orthogonal to the leading population eigenvector vi, i.e., |v^vi| — > almost surely. Thus, in this 
case, vi is not useful at all as an estimate of Vi. Even when Ai > ^7, the angle between vi and Vi 
still does not converge to zero unless Ai — > 00. In addition to being inconsistent, sample principal 
eigenvectors have nonzero loadings in all the coordinates. This renders their interpretation difficult 
when the dimension p is large. 

1.1 Sparse PCA 

In view of the above negative results in the high-dimensional setting, a natural approach to principal 
component analysis in high dimensions is to impose certain structural constraint on the leading 
eigenvectors. One of the most popular assumptions is that the leading eigenvectors have a certain 
type of sparsity. In this case, the problem is commonly referred to as sparse PCA in the literature. 
The sparsity constraint reduces the effective number of parameters and facilitates interpretation. 

Various regularized estimators of the leading eigenvectors have been proposed in the literature. 
See, for example, 26, 0, 0, 0, |49| . l53l . |27J ] . Theoretical analysis has so far mainly focused on the 



rank-one case, i.e., estimating the principal eigenvector vi. In this case, Johnstone and Lu \2l 
showed that the classical PCA performed on a selected subset of variables with the largest sample 
variances leads to a consistent estimator of vi if the ordered coefficients of vi have rapid decay. Shen 
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et al. [451] and Yuan and Zhang 55] proposed other consistent estimators when vi has a bounded 
number of nonzero coefficients. Vu and Lei [5l| studied the rates of convergence of estimation 
under various sparsity assumptions on Vx, and Lounici [34] further considers the minimax rates 
with missing data. Amini and Wainwright [3] investigated the variable selection property of the 
methods by 25] and [3] when vi has k nonzero entries all of the same magnitude. Berthet and 
Rigollet 0] considered minimax detection when vi has a bounded number of non-zeros. 

More recently, for estimating a fixed number r > 1 of leading eigenvectors as n,p — > oo, 
Birnbaum et al. [9(] studied minimax rates of convergence and adaptive estimation of the individual 
leading eigenvectors when the ordered coefficients of each eigenvector have rapid decay. When 
r > 1 and some of the leading eigenvalues have multiplicity great than one, the individual leading 
eigenvectors can be unidentifiable. On the other hand, the principal subspace spanned by them 



is always uniquely defined. Ma 36| proposed a new method for estimating the principal subspace 
and derived rates of convergence of the estimator under similar conditions to those in Q]. 

1.2 Estimation of Principal Subspace 

In this paper, we focus on the estimation of the principal subspace. Both minimax and adaptive 
estimation are considered. Throughout the paper, let X be an n x p data matrix generated as 

X = UDV' + Z. (1) 

1/2 1/2 

Here U is the n x r random effects matrix with iid iV(0, 1) entries, D = diag(A x , . . . , A/ ) with 
Ai > • • • > A r > 0, V is p x r orthonormal, and Z has iid N(0,a 2 ) entries which are independent 
of U. Equivalently, one can think of X as an n x p matrix with rows independently drawn from 
the distribution N(0, S), where the covariance matrix XI is given by 

£ = Cov(Xi*) = VAV + a%. (2) 

Here A = diag(Ai, . . . , A r ) and V = [vi, . . . , vy] ispxr with orthonormal columns. The r largest 
eigenvalues of £ are Aj + a 2 , i = 1, r, and the rest are all equal to a 2 . The r leading eigenvectors 
of £ are given by the columns of V. Since the spectrum of X has r spikes, the covariance structure 



([2]) is commonly known as the spiked covariance matrix model 22] in the literature. 

The goal of the present paper is to estimate the principal subspace span(V) based on the 
observation X. Note that the principal subspace is uniquely identified with the associated projection 
matrix W. In addition, any estimator could be regarded as the subspace spanned by the columns 
of a matrix V with orthonormal columns, hence uniquely identified with its projection matrix W. 
Thus, estimating span(V) is equivalent to estimating W. In this paper we consider optimal and 
adaptive estimation of span(V) under the loss function 

L(V,V) = ||VV- VV'Hl, (3) 

which is a commonly used metric to gauge the distance between linear subspaces. It also coincides 
with twice the sum of the squared sines of the principal angles between the respective linear span. 
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The difficulty of estimating span(V) depends on the joint sparsity of the columns of V. Let 
||Vj*|| denote the Euclidean norm of the j th row of V. Order the row norms in decreasing order as 
||Vr]u|| > • ■ ■ > ||Vrpi^||. We define the weak t q semi-norm of V as 

||V|| giB = max j||Vr 7 -u|| 9 (4) 

and let 

0(p, r) = { V G M pxr : V'V = I r } (5) 

denote the collection of p x r matrices with orthonormal columns. We consider the following 
parameter spaces for 5] where the weak i q semi-norm of V is constrained: 

Q q {s,p,r,X) = {£ = VAV' + I p : A < A r < • • • < Ai < k\, 

(6) 

Ve 0(p,r), \\V\\ q , w < s}, 

where q G [0, 2) and k > 1 is a fixed constant. In the special case of q = 0, the union of the column 
supports of V is of size at most s. Weak £ g -ball is a commonly used constraint to model sparsity. 
See, e.g., Abramovich et al. [l| for wavelet estimation and Cai and Zhou [11] for sparse covariance 
matrix estimation. Group sparsity is also useful for high-dimensional regression, see, for example, 
Lounici et al. [35fl. 



1.3 Optimal Rates of Convergence 

Combining the upper and lower bound results developed in Section [21 we establish the following 
minimax rates of convergence for estimating the principal subspace span(V) under the loss ((3|). We 
focus here on the exact sparse case of q = 0; the optimal rates for the general case of q G (0, 2) 
are given in Section [2J For two sequences of positive numbers a n and b n , we write a n x b n if there 
exist two constants < c < C < oo not depending on n such that c < a n /b n < C. 

Theorem 1. Suppose we observe the data matrix X as in ([T|). Let r < p — s + 1 and n > 
C (slog^ Vlog^j) for some sufficient large constant C. The minimax risk for estimating the 
principal subspace span(V) under the loss ([3]) satisfies 



inf sup E1IVV 7 - VV'Hf x Z. 9 (r(s-r) + slog—) (7) 



v see ( s , Pl r,A) n(X/a z y 

as long as the right-hand side of (0) does not exceed some absolute constant. Otherwise, consistent 
estimators do not exist. 

It is interesting to note that the optimal rate ([7j) depends on the rank r quadratically through 
r(s — r), which is the dimension of 0(s,r). Therefore the dependence on r is not monotonic, with 
the worst case occurring at r = s/2. The rate of convergence in (|7|) has optimal dependence on all 
the parameters s,p,r,n and A. The results thus provide a complete and precise characterization of 
the difficulty of the principal subspace estimation problem in terms of the minimax rate. 
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A key step in establishing the optimal rates of convergence is the derivation of rate-sharp 
minimax lower bounds. It is highly non-trivial to obtain a lower bound which depends optimally 
on all parameters, in particular the singular values and the rank. Our main technical tool for the 
lower bounds is the local metric entropy 32. M. l54l| . which is different from the usual methods based 
on explicit constructions of packing sets together with Fano's Lemma used for example in 
Although the method is abstract in nature, the advantage is that it only relies on the analytical 
behavior of the metric entropy of the parameter space, thus allowing us to sidestep constructing 
an explicit packing, which is a challenging task due to the need of fulfilling both the orthogonality 
and the weak-£ q ball constraints. 

We then construct an explicit estimator using an aggregation scheme, which is shown to attain 
the same rates of convergence as those of the minimax lower bounds. The matching lower and 
upper bounds together establish the optimal rates of convergence. This aggregation method can 
potentially be useful for other high-dimensional sparse PCA problems as well. Aggregation methods 
have been well used and studied in statistics and machine learning. See, for example, Nemirovski 



40| and Rigollet and Tsybakov 4J|. To the best of our knowledge, this is the first application of 



the aggregation approach to sparse PCA which yields optimality results. 



1.4 Adaptive Estimation 

The rate-optimal aggregation estimator depends on the unknown parameters and is unfortunately 
not computationally feasible when p is large. We then propose an adaptive estimation procedure 
that is fully data driven and easily implementable. The estimator is shown to attain the optimal 
rate of convergence simultaneously over a large collection of the parameter spaces defined in (|6|). 

The proposed method is based on a reduction scheme, where the original sparse PCA problem 
is reduced to a high-dimensional regression problem with orthogonal design and group sparsity on 
the regression coefficients. Then, we apply the model selection penalty idea from [8] to construct 
the final estimator. 

A key step in the reduction scheme is the construction of two new samples in the form of ([1]) , 
which share the same realization of the random effects U but have independent copies of the noise 
matrices. This construction works because a common realization of U is critical in maintaining 
the right level of signal-to- noise ratio in the regression problem. In contrast, splitting the original 
sample into two halves fails to achieve this goal. On the other hand, the independence of the noise 
components ensures that the regression problem has white noise structure. The adaptivity and 
minimax optimality of the subspace estimator depend heavily on those of the regression coefficient 
estimator. Thus, as a byproduct of the analysis, we also show that our estimator for regression 
coefficients is adaptively rate optimal under group sparsity. To the best of our knowledge, the 
specific estimator and its adaptive optimality is also new in the literature. 
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1.5 Other Related Work 



The present paper is also related to a fast growing literature on estimating sparse covariance/precision 
matrices as well as low-rank matrices. Significant advances have been made on optimal estimation 
of the whole covariance or precision matrix. Many regularization methods, including banding, ta- 



121 established 



pering, thresholding and penalization, have been proposed. In particular, Cai et al. 
the optimal rate of convergence for estimating a class of bandable covariance matrices under the 
spectral norm. Cai and Yuan 10| proposed a block thresholding procedure which is shown to 
adaptively achieve the optimal rate over a wide range of collections of bandable covariance matri- 
ces. Bickel and Levina [f| introduced a thresholding procedure and obtained rates of convergence 
for sparse covariance matrix estimation. Cai and Zhou ll[ established the minimax rates of con- 
vergence for estimating sparse covariance matrices under a range of matrix norms including the 
spectral norm. Cai et al. 13J obtained the optimal rate of convergence for estimating the sparse 
precision matrices. 

Our work is also related to another active area of research, namely the recovery of low-rank 
matrices based on noisy observations. Negahban and Wainwright [39] studied (near) low-rank 
matrix recovery by M-estimators under restricted strong convexity based on the penalized nuclear 
norm minimization over matrices. Koltchinskii et al. 30( considered estimation of low-rank matrices 



based on a trace regression model which includes matrix completion as a special case. A nuclear 
norm penalized estimator was proposed and a general sharp oracle inequality was established. See 
also Recht et al. [42] and Rhode and Tsybakov 43]. 



1.6 Organization of the Paper 

The rest of the paper is organized as follows. After introducing basic notation, Section [2] establishes 
the minimax rates of convergence for estimating the principal subspace by obtaining matching 
minimax lower and upper bounds. An aggregation estimator is constructed and shown to be rate 
optimal. Section [3] introduces an adaptive estimation procedure for the principal subspace which is 
fully data driven and easily computable. It is shown that this estimator attains the optimal rates 
of convergence simultaneously over a large collection of parameter spaces. Connections to other 
related problems are discussed in Section 01 The proofs of the main results and key technical lemmas 
are given in Section [5] and some additional technical arguments are contained in the appendix. 



2 Minimax Rates for Principal Subspace Estimation 

We establish in this section the minimax rates of convergence for estimating the principal subspace 
in two steps. First, minimax lower bounds are obtained for the estimation problem under the loss 
([3]). Then an aggregation estimator is introduced and is shown to attain the same rates as given in 
the lower bounds, under mild conditions on the parameters. The matching lower and upper bounds 
thus establish the minimax rates of convergence. 

We begin by introducing some basic notation. Throughout the paper, for any matrix X = (xij) 
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and any vector u, denote by ||X|| the spectral norm, ||X||f the Frobenius norm, and ||u|| the 
vector £2 norm. Moreover, the i th row of X is denoted by Xj* and the j th column by X*j. Let 
supp(X) = {i : Xj* 7^ 0} denote the row support of X. For a positive integer p, \p] denotes the 
index set {1, 2, ...,p}. For two subsets I and J of indices, denote by X/j the |/| x \J\ submatrices 
formed by X{j with E I x J. Let X/* = Xn n i and X*j = Xljj. For any square matrix 

A = (<%)> we let Tr(A) = Yli a a be its trace. Define the inner product of any two matrices B and 
C of the same size by (B, C) = Tr(B'C). For any matrix A, we use (Jj(A) to denote its i th largest 
singular value. When A is positive semi-definite, 0"j(A) is also the i th largest eigenvalue of A. For 
any real number a and b, set aVb = max{a, 6} and af\b = min{a, b}. For any set A, \A\ denotes its 
cardinality. Let S p_1 denote the unit sphere in M p . Let G(k,r) denote the Grassmannian manifold 
consisting of all r-dimensional linear subspace of Let 0(p) denote the collection of all p x p 
orthogonal matrices. Throughout the paper, we use C to denote a generic positive constant, though 
the actual value may vary at different occasions. 

Let q G [0, 2) and s > 0. Denote the weak-^q ball on 0(p, r) by 

g q (s,p) = {& G 0(p, r) : \\V\\ qjW < s} , (8) 

which is the parameter space of V. In order for G q (s,p) to be non-trivial, i.e., neither empty nor 
the whole 0(p,r), the weak-£ g radius must satisfy (see Appendix [6J] for a proof) 

— - — r < s < p. (9) 

In particular, if q = 0, then we have 1 < r < s < p. Throughout the paper, we assume that ([9]) 
holds. 



2.1 Lower Bounds 

We first establish the minimax lower bounds which are instrumental in obtaining the optimal rates 
of convergence. In view of the upper bounds given in Section 12.21 by an aggregation procedure, 
these lower bounds are in fact minimax rate optimal. 

Before proceeding to the precise statements, we introduce the following notation: Let 

and 

V(k,p,r,n,\) = (rfc + Hogy) , (11) 

1 / e P\ 
^o(k,p,r,n, A) = [r(k - r) + k log —J . (12) 

Define the effective dimension by 

k* g (s,p, r, A, n) = [x q (s, p, r, A, 71) J A p, (13) 
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where for any number a, \a\ is the largest integer smaller than a, and x q (s,p, r, A, n) is the solution 
to the following equation 

^mv 12 (14) 

Remark 1 (Effective dimension) . The effective dimension k* is a proxy to capture the massiveness 
of the parameter set for the principle subspace under the we&k-£ q constraint. Moreover, the minimax 
estimation rate turns out to be a strictly increasing function of k*. From f|13f) it is evident that 
kg = s. Therefore in the exact sparse case, the effective dimension coincides with the row sparsity 
of V. Moreover, for any q G (0,2), the equation (fbi|) always has a positive solution. Under the 
assumption (fT6l) . it can be shown that the solution satisfies x q (s,p, r, A, n) > s. Consequently, 
kg(s,p,r,X,n) > s. 

Without loss of generality, we shall assume unit noise standard deviation (a = 1) from now on. 
All results hold for a general a by replacing A with A/cr 2 . We consider the lower bounds separately 
in two cases: < q < 2 and q = 0. 

Theorem 2 (Lower Bound: < q < 2). Let k and r be positive integers. Let the observed matrix 
X be generated by model (pQ) with a = 1. Let k* be defined in $13\) . Assume that 

r<^A(p + l-k* q ), (15) 

and 

nh{\) >C (r + log (16) 

for some absolute constant Cq. Then there exists a constant c depending only on q and an absolute 
constant cq, such that the minimax risk for estimating V over the parameter space O = Q q (k,p, r, A) 
satisfies 

inf sup E||VV' - VV'Hl > c^(k*p, r, n, A) A cq. (17) 
v see 

Note that the above lower bound is obtained under the assumption (|15|) . which implies that 
r < |. In view of ([9]), given a we&k-£ q radius s, the rank of V could take values up to In 
particular, in the exact sparse case where q = 0, r takes values in the full range [s]. An intriguing 
question is what happens when the rank r exceeds |? The answer turns out to be interesting: In 
the sparse case, the statistical difficulty for estimating the r leading singular vectors depends on the 
r only through r(s — r), which is the dimension of the Grassmannian manifold G(s,r). Therefore 
the dependence is not monotonic, with the worst case happening at r = |. Moreover, the minimax 
rate is invariant if we replace r by s — r. The following more precise lower bound characterizes this 
behavior precisely for the case q = 0. 

Theorem 3 (Lower Bound: q = 0). Let the observed matrix X be generated by model (pQ) with 
(7 = 1. Assume that s and r are positive integers satisfying 

r<p + l-s. (18) 
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Then there exist two absolute constants c and cq, such that the minimax risk for estimating V over 
the parameter space satisfies 



inf sup E||VV' - VV'fp > c^ (s,p,r,n, X) A c . (19) 

V See (s,|),r,A) 

2.2 Optimal Estimation via Aggregation 

We now show that the lower bounds given in Section [2"TT1 are indeed rate optimal under mild technical 
conditions. The optimal estimator of V is constructed using sample splitting and aggregation. The 
estimator is theoretically interesting but computationally intensive. We will construct a data-driven 
and easily implementable estimator in Section [3l 

We first note that the loss function ((3|) satisfies the following 

L(V, V) = 2r - 2||V'V||| = 2||(I - VV')VV'|||. (20) 

Moreover, the loss function is invariant under orthogonal complement, i.e., L(V, V) = L(V _L , V -1 -), 
where [V, V- 1 ], [V, V- 1 ] are orthogonal matrices. Therefore the loss ([20]) admits the following upper 
bound 

I(V,V)<2(rA(p-r)). (21) 



For notational simplicity we assume that the sample size is 2n and we split the sample equally 
according to X = x £ , where X (i) = U^DV' + Z (i) ,i = 1,2. Denote by S (i ) = ^X' (i) X (i ) 
the corresponding sample covariance matrix. The main idea is to construct a family of estimators 
{Vb} based on the first sample, indexed by the column support B C [p], where Vg is the optimal 
estimator one would use if one knew beforehand that supp(V) = B. Then we aggregate these 
estimators by selection using the second sample. 

Recall the effective dimension k* defined in f)13|) . For each B C [p] such that \B\ = k*, we define 
Vb £ 0(p,r) as the r leading singular vectors of JgSmJ_B, where Jg is the diagonal matrix given 
by 

(Jfi)ii = l{ieB}- (22) 



Let 



B* = argmaxTr(V , B S (2) V B ) (23) 

Bc[p) 
\B\=k* 



and define the aggregated estimator by 



V* = V B *. (24) 



The estimator (|24p requires knowledge of the value of q, the weak-£ q semi-norm s and the rank 
r. Moreover, it can be computationally intensive since in principle one needs to enumerate all 
possible column supports in order to obtain B*. Nonetheless, the next theorem establishes its 
minimax rate optimality: 
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Theorem 4. Let q G [0,2). Let k* be defined in i!3\). Let V* be the aggregated estimator defined 
in \2J$ . Assume that 



and 



nh(X) > C k* q (r + log , (25) 

n>C ^*log^VlogA^ (26) 

for some sufficiently large constant Cq. Then there exists an absolute constant C such that for 
9 = e q (k,r,p,X), 

sup E||V*V; - VV'lH < C(rA(p-r)A*(k; iP) r |n) A)), (27) 
see 

where ty(k,p,r,n, A) and k* are defined in [77]) and [T3\) , respectively. Moreover, if q = 0, then 
in fi??| ) can be replaced by ^o defined in (W\) and k$ = s. 

The lower and upper bounds together yield the minimax rates of convergence ^(k*,p,r,n,X) 



given in (jlip and (|13p with the optimal dependence on all the parameters, in particular the singular 
values and the rank. The results thus provide a complete and precise characterization of the 
difficulty of estimating the principal subspace in term of the minimax rate. 



In the special case of r = 1, a similar combinatorial procedure has been proposed in 5l[. Using 



Mendelson's results on empirical processes 37j], this procedure is shown to attain a convergence 
rate that is optimal in all parameters except for A 0, Theorem 2.2]. Comparing with the analysis 



m 



511 ] , the proof of Theorem d] is more elementary. By exploring the structure of the difference 
between the sample covariance matrix and the true covariance matrix, we obtained an upper bound 
that is optimal in all parameters. 

An interesting side product of the proofs of Theorems [3] and U] is the following non-asymptotic 
minimax rate for the regular PCA problem without structural assumptions on the principle sub- 
spaces. It is a classical result (see, e.g., 0, [3]) that when p < n, the sample covariance matrix is 
not exact minimax optimal for estimating the whole covariance matrix under certain losses (e.g., 
the Stein loss). As shown in the next theorem, it turns out that the sample version of the principle 
subspace is minimax rate optimal even in high dimensions. For more details see Theorems [8] and [9] 
in Sections 15.11 and 15.21 . 

Theorem 5. Let O = &o(p,p, r, A). Let n > Co log A for some sufficiently large constant Co. Then 
for all r £ [p], 

inf sup E||VV' - VV'Hl X f r A (p - r) A ) > ( 28 ) 

v see V nh{\) J 

which can be attained by V formed by the r leading singular vectors of the sample covariance matrix 
S. 

Theorem [5] implies that, without structural assumptions on the principle subspace V, consistent 
estimators exist if and only r "^^ — > oo. Moreover, unless nh(X) exceeds a constant factor of p, 
even the optimal estimator is within a constant factor of r A (p — r), the upper bound of the loss 
function. 
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3 Adaptive Estimation 



The aggregation estimator constructed in Section T2.2I has been shown to be rate optimal. However, 
it depends on the unknown parameters and is computationally infeasible when p is large. We 
construct in this section an adaptive estimation procedure for the principal subspace which is fully 
data driven and easily computable. Furthermore, it is shown that the estimator attains the optimal 
rate of convergence simultaneously over a large collection of the parameter spaces defined in (|6|). 

A key idea in our construction is a reduction scheme which reduces the sparse PCA problem 
to a high-dimensional multivariate regression problem. This method is potentially applicable to 
other sparsity patterns of the leading eigenvectors. We first introduce the general reduction scheme 
in Section 13.11 which transforms the principal subspace estimation problem to a high-dimensional 
multivariate regression problem. The specialization of this general method under weak-£ q constraint 
will be detailed in Section [3.21 

3.1 A General Reduction Scheme 

The general reduction scheme involves four steps, which are introduced in order below. The pro- 
cedures used in Steps 2 and 4 for initial and final estimation will be specified in Section 13.21 for 
weak-£ q constrained parameter spaces. For ease of exposition, we regard the rank r as given in the 
statement below. Data-based choice of r will be discussed at the end of Section 13.21 
Step 1: Sample generation. Given X in (JTJ) with a = 1, we generate annxp random matrix Z 
with iid N(0, 1) entries which are independent of U and Z, and form two samples X 1 = X+ (— l) l Z, 
i = 0, 1. Let Z* = Z + (— l) l Z for i = 0, 1, then Z° and Z 1 are independent, and their entries are 
iid N(0, 2) distributed. Then, the two samples X° and X 1 can be equivalently written as 

X* = UDV + Z\ i = 0, 1. (29) 

Let S* = ^(X^'X 1 , i = 0,1, be the sample covariance matrices for the two samples. 

Step 2: Initial estimation. We use the sample X° to compute an initial estimator V°. A specific 

procedure for computing the initial estimator V° will be given in Section 13.21 

Step 3: Reduction to regression. Form 

(X x yX°\° = VA + (Z^'B, (30) 

where B = X°V° = UDV'V + Z°V°, and A = DU'B. 

Note that conditioning on U and Z°, both VA and B are fixed. Hence, (|30p becomes a regression 
problem, with additive noise matrix (Z X )'B with normal entries. However, since B does not have 
orthonormal columns, the noises are not iid. 

To deal with this issue, we introduce a further "whitening" step as follows. Note that B = X° V° 
is observed. Thus, we can compute its singular value decomposition as B = LCR'. Post-multiplying 
both sides of ([30]) by ^RCT 1 , we obtain 

Y = e + E, (31) 
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where 



Y = 



- p (X 1 ) / X V°RC 



-i 



= -^VARC 



-l 




(32) 



Note that Z 1 has iid iV(0, 2) entries, and that L G W lXT has orthonormal columns. Since Z 1 is 
independent of both U and Z°, it is also independent of B and L. Hence E G W xr has iid N(0, 1) 
entries and independent of 0. Conditioning on U and Z°, the matrix in (|31|) is fixed. Therefore 
(|3ip becomes a standard multivariate regression problem with orthogonal design and white noise. 

Step 4: Final estimation. In the final step, we find an estimator for under model (|3ip 
by treating it as a regression problem, and obtain the estimator V for V by orthonormalizing the 
columns of 0. The orthonormalization can be completed by the Gram-Schmidt procedure or QR 
factorization. 

An important feature of the above reduction scheme is that the two samples X° and X 1 share 
the same realization of random factors U and their only difference is in the noise matrices Z° and 
Z 1 . This is critical for maintaining the right level of signal-to- noise ratio in the regression problem 
(|31|) when conditioning on U and Z°. In contrast, splitting the original sample into two halves as 
in Section 12.21 does not achieve this goal here. Since our analysis relies on the independence of Z° 
and Z 1 , the normality of the noise is crucial to this scheme. 

3.2 Sparse PCA and Regression with Group Sparsity 

We now apply the general reduction scheme to the principal subspace estimation problem under 
parameter spaces (|6|). In what follows, we first introduce the specific estimators for both the initial 
and the final estimation steps. Then, we show that the general reduction scheme paired with the 
two specific estimators lead to a final estimator which adaptively achieves the optimal rates of 
estimation over a large collection of the parameter spaces of interest. For clarity of exposition, we 
regard the rank r as given when introducing the estimators. Data-driven choice of r is discussed 
at the end of this subsection. 

Initial Estimation Let p n = p V n. We construct the initial estimator V° via the diagonal 
thresholding method [25[ as follows: 

1. Define the set of features 



J = {j 



: s°- > 2(l + a^logp n /n)}, 



(33) 



where {s^}^ =1 are the diagonal elements of S° = ^(X^'X . 
2. Compute the first r eigenvectors of the submatrix Sjj: vf , . 




3. Define V° G 0(p,r), where 



J* — 




= 0. 



(34) 
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The following result gives sufficient conditions to guarantee that the initial estimator V° is 
reasonably close to V, which suffices for the initialization of our scheme. 

Proposition 1. Suppose for sufficiently large constants Mo, Co > 0, 

logn>M logA, (35) 

K2s( iog^y/^ s(2 _ 9)8/2/ ^ (36) 



If*V° is defined in (|34h with a sufficiently large a > ^/ 10(1 + 1/Mq) in (I33D . then uniformly over 
6 = Q q (s,p, r, A), we have 

|supp(V°)| < k* and CT r (V'V°) > 1/2 (37) 
hold with probability at least 1 — C /{nh{\)\, where k* = k*(s,p,r, A,n) is define in ([Dp- 
Remark 2. When Mq in (j35j) is unknown, we could replace it by 

M = logn/log(a 1 (S°)-2) (38) 
where <ti(S°) is the largest eigenvalue of S°. This estimate works because <7i(S°) — 2 is over- 



estimates A with high probability [4l|, |38j, since the noise variance here is two. The estimator (|38p 
allows us to choose a in (j33|) without explicit knowledge of Mq. 

Final Estimation: Orthogonal Regression with Group Sparsity In this step, we always 
regard the matrix in f)31 1) as fixed. Hence ()31[) is indeed a regression model with iid -/V(0, 1) 
noise. When the sparsity of V is specified as in ([6]), we need to consider the following parameter 
space for 0: 

F q {s' )P ) = {© : \\&\\ q , w < s'}, (39) 

with q S [0,2). The parameter s' is typically different from s in ([6]), depending on the other model 
parameters as well as the realization of U and Z°. However, this will not cause any difficulty in 
practice, because the estimator we propose below and the associated theorem remain valid for all 
values of s' > 0. Moreover, s' can be controlled with high probability. 

In the literature of high-dimensional regression, (|39p is usually referred to as the group sparsity 
constraint on the regression coefficients 0. In this setup, we propose the following method for 
computing 0. Define 

t k = r + ^2r(3log^ + f3log^, (40) 

and 

k 

pen(0) = pen(|supp(0)|), where pen(A:) = ^(1 + 5) 2 ti (41) 

i=l 

where S G (0, 1) is a small constant. Then the estimator for is defined as 

= argmin ||Y - 0||| + pen(0). (42) 
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Such penalized least squares approach has been widely used in orthogonal regression with various 
choices of the penalty functions. See, for example, Birge and Massart and Abramovich et al. 

EL 



Remark 3. The penalized least squares estimator in (|42p is easily computable. Recall (|3ip and 
write the matrix Y by rows, Y = [yi, . . . , y p ]'. Let denote the the row in Y with the i th largest 
t 2 norm, i.e., ||y(i)|| > ||y( 2 )ll > > ||y(p)||, and define 

( k p ^ 

£; = argmhJ^(l + 5) 2 i,+ ||y ( i)|| 2 [. 



ke[ P ] 



. i=l i=k+l 



It is clear that k is easy to compute. Then the estimator is given by = \0\, . . . , 9 p ]' with 

9< = y<,1 {w>MV 



For the estimator in ([42 p . we have following upper bound on its risk. 
Theorem 6. Consider the regression problem 

Y = + E, 

where is the p x r regression coefficients of interest and E has iid N(0, 1) entries. Let the 
parameter space J- q (s',p) be defined in (|39p for some q £ [0,2) and s' > 0. If (3 > 2 in (|40p . then 
there is an absolute constant C > 0, such that the estimator in ()42|) satisfies 

sup E||0 -0||| < Chf (V + log^H , 
@&T q (s',p) V k ' 

where 

k' = min{k : t q k /2 k > s'}. (43) 
If the set in ([43 p is empty, then k' = p. 



By the lower bounds in [35(, the rates in Theorem [6] are optimal. 



Adaptation With the above preparation, we are now ready to show that if we start with a proper 
initial estimator V° (such as that in ([34"]) ) and estimate by (|4"2"j) . then the estimator V resulting 
from orthonormalizing the columns of achieves the optimal rates of convergence. We state the 
theorem in a slight more general format. In particular, it holds for the initial estimator in (|34|) 
under conditions ([35]) and ([36]) . 

Theorem 7 (Adaptation). Let A > Co for some sufficiently large constant Co. For any = 
Q q (s,p, r, A) such that the conditions in Theorem^ hold and that an initial estimator V° satisfying 
(]37p exists, the estimator V obtained by orthonormalizing in (j42p with (3 > 2 in (|40p satisfies 

sup EUVV'-VV'lll < C(r A (p - r) A tf(Jfc* p,r,n,A)), 
see 

where C > is an absolute constant and k* is given in ([T3]) . 
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We note that the assumption A > Co is imposed to ensure that the "whitening" procedure in 
Step 3 of the reduction scheme can be performed. 

It is interesting to compare the statement of Theorem[7Jto the minimax lower bound in Theorems 
[2] -[3] as well as the the performance of the combinatorial aggregation estimator V* established in 
Theorem [H For any parameter space = Q q (s,p, r, A) such that the conditions ([35]) and (|36|) 
hold, we could use the V° in ([34"]) . and the resulting V is guaranteed to achieve the optimal rates 
of convergence on O, which matches the performance of the aggregation estimator for any q > 0. 
Moreover, in this case both V° and V can be efficiently computed. Hence V can be used in 
practice while V* is computationally intensive. However, in the exact sparse case of q = 0, the 
upper bound in Theorem [7J depends on the rank r linearly through sr, while the true minimax 
rate in Theorem [3] depends on r quadratically through r(s — r), which is much smaller than rs if 
s — r is small. The suboptimality of V in this specific regime is partially due to the fact that our 
reduction scheme transforms the problem into a regression problem without taking account of the 
orthogonality structure of the parameter space. 

Remark 4. Theorem [7] also shows that any estimator V° satisfying (I37h can be used to produce 
an adaptive estimator. Therefore, the task of constructing adaptive optimal estimators is reduced 
to constructing a "reasonable" estimator. 

Consistent Estimator of r Last but not least, we discuss how to construct consistent estimator 
of r based on data. To this end, recall the definition of the set J in (j33j) . and the matrix Sjj. We 
propose to estimate r by 

f = max{/:aKS° JJ )>2(l + ,5| J |)}, (44) 
where for any m > and Mq in (j35[) . we define 

with = ^((m+1) log(ep) + (l+2/Mo) log re). Here, we regard Mq in ([35]) as known. Otherwise, we 
could always replace it with the estimator ([38]) proposed in Remark [2j Note that the estimator (|44jl 
could be easily integrated with the diagonal thresholding method for computing V°. In particular, 
r can be computed after we select the set J in (|33|) . 
For this estimator, we have the following result. 

Proposition 2. Under the condition of Proposition d f = r holds with probability at least 1 — 
C[nh{X)}- 1 . 

Under conditions (|35p - (I36p and those in Theorem [7J Proposition [2] implies that the conclusion 
in Theorem [7] still holds if we replace r by f. 

4 Discussions 

We have focused in the present paper on the estimation of the principal subspace span(V) under 
the loss ([3]). The minimax rates of convergence are established and a computationally efficient 
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adaptive estimator is constructed. A problem closely related to principal subspace estimation is 
the estimation of the whole covariance matrix X! under the same structural assumption ([6]) . In this 
case it is more natural to use the spectral norm as the loss function L(S,S) = ||S - £|| 2 . Both 
minimax estimation and adaptive estimation are of significant interest. Another relevant question 
is whether a plug-in estimator of the type S = VAV + <5" 2 I p , where V is the adaptive estimate 
of V given in Section [3] and A and a 2 are some estimates of A and a 2 respectively, can be rate 
optimal under the spectral norm loss. 

It is interesting to extend the aggregation method in Section [2?2l to other settings beyond sparsity 
or weak £ q constraints. In the exact sparse case (q = 0), note that the rate-optimal estimator in 
(|27p is constructed by choosing the best estimator from a collection of estimators, each of which is 
designed for a specific sparsity pattern. Theorem [J] can now be interpreted as an oracle inequality 
for the average risk, which is within a constant factor of the oracle risk plus the excess risk 

nfe ^ log (j!) . One immediate generalization of Theorem [3] is that we can also construct aggregated 
estimators if it is known that the true principle subspace belongs to a collection of N subspaces. 
Then the excess risk does not exceed log N. It is much more challenging to obtain an oracle 
inequality with the constant one, which implies an upper bound on the minimax regret. The current 
aggregation method based on the equal sample splitting is, however, not sufficient to achieve this 
goal. 

It should be noted that our analysis in this paper relies on the normality assumption. In 
particular the adaptive procedure requires the independence of Z° and Z 1 , which is a consequence 
of the normality of the noise. It is unclear whether the same results hold for all noise distributions 
with sub-Gaussian tails. It is an interesting problem to study the robustness of the adaptive 
procedure and to extend the results to other noise distributions. 

The adaptation procedure proposed in the current paper shows that sparse PCA is connected 
to the Gaussian sequence model. Moreover, the optimal rates for the s par se PCA problem derived 



in the present paper coincide with those for the regression problem in 35J under a proper scaling. 



Thus, an intriguing theoretical question is whether certain forms of the two problems are indeed 



asymptotically equivalent to each other in the Le Cam's sense 33[ under appropriate conditions. 
Such an asymptotic equivalence result would enable deeper understanding of the sparse PCA prob- 
lem, and guide the development of other adaptive estimation procedures by borrowing the insights 
from the regression problem. 

5 Proofs 

In this section we prove Theorems El H] and [71 The proofs of the other results, together with those 
of the key lemmas and some additional technical arguments, are given in the appendix. 
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5.1 Proof of Theorem [3] 



We first give a lower bound on the oracle risk where we know beforehand the row support of V. 
This corresponds to a fc-dimensional unstructured PCA problem, where the goal is to estimate the 
r leading singular vectors of the covariance matrix. 

Theorem 8 (Oracle risk: lower bound). Let Q = &o(k, k, r, A) where A > 0. Then 

inf sup EIIVV' - W||| >c(rA(k-r)A ,7^ ) . (45) 
v see V nh W J 

where c is an absolute constant. 



To prove Theorem [8l we use a minimax lower bound due to Yang and Barron 54], Section 7] 
via local metric entropy, which in turn relies on an argument by Birge |7]. The situation here 
is slightly different from that in [54] in the sense that we use global covering number instead of 
packing number to derive bounds on local packing number. For completeness, we state the result 
in Proposition [3] and provide a short proof in Section \6. 61 The method of local metric entropy in an 



-^-neighborhood dates back to Le Cam 32]. The advantage of this method is that it only relies on 
the analytical behavior of the metric entropy of the parameter space, thus allowing us to sidestep 
constructing explicit packing set in the parameter space. 

Proposition 3. Let (Q, d) be a totally bounded metric space and {Pg : 9 G 0} a collection of 
probability measures. For any E C O, denote by J\f(E,e) the e-packing number of E, i.e., the 
minimal number of balls of radius e whose union contains G. Denote by A4(E,e) the e-covering 
number of E, i.e., the maximal number of points in E whose pairwise minimum distance is at least 
e. 

Put 

g^g> d z (6, 

If there exist < cq < c% < oo and d > 1 such that 



, D(Po\\Pq>) (ar . 



C 0\ d / KT/r^ \ / / C l 



for all < e < cq. Then 



) <M(0,e)<(-j) (47) 



infsu P E,[(i 2 (^X),0)] > (4 A£ o 1 ( lS) 



We also need the following result regarding the metric entropy of the Grassmannian manifold 
G(k,r) due to Szarek Q- 

Lemma 1. For any V 6 0(k, r), identifying the subspace span(V) with its projection matrix W, 
define the metric onG(k,r) by d(W, UU') = || W'— UU'||p. Then for any e £ (0, \/2r A (p — r)\, 

' C ° V "-"<^'(0( fc ,r), £ )<(^l)' (i -' , (49) 
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where cq,c\ are absolute constants. Moreover, for any V E 0(k,r) and any a E (0, 1), 

\ r(k—r) 



M(B(V,e),ae)>(^\ (50) 



Proof. Note that d(W',UU') = V2\\(I - VV')VV'||Fj_in view of ([2D]). This metric is unitarily 
invariant (see p' a in [48j, Remark 5, p. 175]). Applying [43, Proposition 8, p. 169] with a(-) = ||-|| 
gives ([4*9]) . By the proof of (|125j) . for any e E (0, y/2r A (p — r)] and any a E (0, 1), there exists 
V* E 0{k,r) such that M{B(V*, e), ate) > {-^) r{k ' r) ■ Now for any V E 0(k,r), there exists a 
T E 0(p), such that V = TV*. Then (150j) holds since the metric d is unitarily invariant. □ 

Proof of Theorem [3 For the purpose of lower bound, we consider the special case of Ai = ■ • • A r = A, 
i.e., X = AW + Ifc. A simple calculation of the Kullback-Leibler divergence yields 

D(N(0, AW + l k ) n || JV(0, AUU' + l k ) n ) = nh{\)\\ W - UU'|||. (51) 

In view of ([46 p . we have A = nh{\). Applying Proposition [3] with eo = \/r(k — r) yields the desired 

m. □ 



Proof of Theorem^ Let G = Qo(k,p, r, A). By definition (|13p . coincides with s. Under the 
assumption of Theorem [31 our goal is to prove the following non-asymptotic lower bound: if r < 
p + 1 — s, then 

inf sup E||W - W||| > c( 1 A — \— ( r(s - r) + s log —)) . (52) 
v see V nh(X) V s J J 

where c is an absolute constant. It is sufficient to prove the following inequalities separately: 

inf sup E||W - W||| > cr A (s - r) A ^ S ~^ (53) 
v see nh{\) 

and 

inf sup E||W - W||| > 1 A -4^tt log — . (54) 
v Eee nh{\) s 



The inequality (|53p follows from an oracle argument: Consider the following sub-collection 



V 



Vi 




: Vi eO(s,r) 



Split the data matrix according to X = [Xi,X2], where Xi consists of the first s columns. Let 
A = diag(Ai, . . . , A s ). Then the rows of Xi and X2 are iid according to 7V(0, ViAV^ + I s ) and 
N(0,lp- s ), respectively. Therefore a sufficient statistic for estimating V is Xi. This reduces the 
problem to an s-dimensional unconstrained PCA problem. Applying the lower bound in Theorem 
[8] yields (PU- 



IS 



The inequality (|54p follows from the existing result of rank-one estimation (e.g., 0, [sij] ) . To 
make the argument rigorous, we focus on the special case where {v2, . . . , v r } are fixed to be standard 
basis. Denote the following sub-collection 



V 



vi 
Ir_i 



Vl G SP-Msupp( Vl )| < ^ , (55) 



which is well-defined since we have assumed that s < p — r — I'm Theorem [2j Let Xi denote the first 
p — r + 1 columns of X. Restricted on the subset (|55p , the estimation error of V is lower bounded by 
that of estimation error of vi based on Xi. This is equivalent to replacing the ambient dimension 
p by p — r + 1 and estimating only the leading singular vector vi under the loss ||viv^ — viVjJ| F . 
Applying the minimax lower bound in 0, Theorem 2], we have 

. t-wCrCrl xrtr/112 ^ ck i e(p - r + 1) c' k ep 

mf sup E VV - W F > — — log '- > — — log 

v see nn{\) s nn{\) s 

where we have used r < § implied by assumption (|18p . The proof of Theorem [3] is now completed. 

□ 

5.2 Proof of Theorem [4] 

We first state a few technical lemmas and an oracle upper bound. Some of the proofs are relegated 
to the appendix. 

Lemma 2. Let a,b,c> 0. Then ax 2 < bx + c implies that x 2 < ^ + ^. 

Proof. Since \x - £| < Vb ' 2 j a Aac , we have x 2 < b2+ f a t 4ac . □ 
Lemma 3. Let E = I p + VDV'. For any T E 0(p, r), we have 

y || W - TT'Hp < (S,W - TT') < y || VV' -TT'Up. (56) 



Lemma 4. Let K E M?p x p be symmetric such that Tr(K) = and ||K|| F = 1. Let Z be n x p 

consisting of independent standard normal entries. Then for any t > 0, we have 



(l ot 2 *) 

p( 7 =|<Z'Z,K>|>2 t + -^]<2e X p(-^). 



(57) 



Lemma 5. Let X±, . . . , Xn be i.i.d. such that 

P{\Xi\ > at + bt 2 } < cexp(-t 2 ). (58) 

where a, b, c > 0. Then 



Emax \XA Z < (2a z + 86^) log(eiV) + 2b z log 2 N. (59) 
ie[N] 
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Lemma 6. Let E be a symmetric positive definite matrix. Let F be a symmetric matrix. Then 

| (E,F) | < ||F|| Tr(E). (60) 

Lemma 7. Let G J : q {s,p) and k G \p\. Let ||0||m denote the i th largest row norm. Then 

EW d \\h<^-Hs/k)^ (61) 

i>k ^ 



Proof. By the definition of J- q {s,p) in (I39p . we have 

/"°° ^ 

2 H0|| < ^ E r2/ " ^ s<?/2 / x " 2/9dx = —Hs/k) 
i>k i>k Jk q 



2/q 



□ 



Theorem 9 (Oracle risk: upper bound). Let p = k and r £ [k]. Let n > Co log A for some 
sufficiently large constant Cq. Let\ G 0(p,r) be formed by the r leading singular vectors o/S. Let 
@ = @o(k,k,r,X,K). Then 

sup EIIVV' - Will < C (r A ^ '^A . (62) 
see V nh W J 

Proof of Theorem^ Before delving into the details, we give an outline of the proof as follows: 

1. We find a good sparse approximation of the true singular vectors which lies in the weak-^ 
ball defined by (p9l) . 

2. We decompose the risk into a summation of three terms, namely the approximation error, 
oracle risk and excess risk, the first two of which are upper bounded in Lemma[7]and Theorem 
O respectively. 

3. The excess risk is controlled by a careful concentration-of-measure analysis, which forms the 
core of the proof. 

Step 1: Sparse approximation. Fix V G 0(p,r) fl J 9 (s, p). We assume that q > 0. Note that 
this step is superfluous if q = since V is already sparse. Let k = k* be defined in (|13p . Let 
B(k) = {B C [p] : \B\ = k}. Let A G B{k) denote the collection of row indices of V corresponding 
to the k largest row norm. Put 

S = J A SJ A + J A c = J A VAV'J A + I p , (63) 

where J a is the diagonal matrix defined in ([22]). Denote the SVD of J^VAV'J^ by VAV, where 
A = diag(Ai, . . . , A r , 0, . . . , 0) and V G 0(p, r) n Fq{s,p), since supp(V) = A. Now we claim that 
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V is in fact the r leading singular vectors of £. To this end, note that the singular values of £ are 
{1 + Ai, . . . , 1 + A r , 1}. In view of (|63p . it is sufficient to show that the r th lar gest singular value of 
£ is separated from one, i.e., ov r \(£) > 1. This follows from Weyl's theorem [20 . Theorem 4.3.1]: 



CT r (£) > <r r (S) 



Ell > 1 + A r 



If- 



Put U = J A V. Then 



= ||VAV - UAU'Hf 

< ||(V - U)AV'|| F + ||UA(V - U)'|| F 

< 2Ai||V-U|| F 



< 2Ai 



< 2Ai 



-k(s/k) 2 lo 



-fy(k,p, r, n, A) 



< 



A, 



(64) 
(65) 
(66) 



where (|64p follows from applying Lemma (|65|) follows from the choice of k = k* in fjl3j) and (|66p 
is implied by the assumption (|25p . Therefore 



a r (£)>l + ^. 



(67) 



Since we have verified that V indeed corresponds to the r leading singular vectors of S, we obtain 
the SVD of ([63]) as 

£ = VAV' + I p 

Using Theorem 1 101 we show that V provides a good sparse approximation of V: 

2||E - E| 



|W - Will < 



12 ^ 32qn 2 

< V{s,p,r,n,\). 



(69) 



(<7 r (E)-l)2 " 2 , 

where the last inequality follows from (|64p and (|67p . If q = 0, then we define V = V. 

Step 2: Risk decomposition. By definition of the maximizer B* in (|23p . (S( 2 ), V aV' a — V*V^ < 
0. In view of Lemma [3l we have 



■^r II 
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l|2 
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vv' - 


vv' 


> + 
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-VaV'a) 
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> 




W - 


w 


> + 
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E - S( 2 ), VaV a - 


v*v;) 
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W - 
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> + 




vv' 


-VaV a ) 




E - S( 2 ), V A V' A - 


v*v;) 


Ai 
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W - 


vv' 
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Ai 
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vv' 


-VaV'a 
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(s-s (2) ,v A v A 


-v*v; 



(70) 
(71) 



approximation error 



oracle risk 



excess risk 
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where (|70p follows from that supp(V) = suppCV^) = A, and (|7ip follows from Lemma O 

Note that the expected oracle risk is upper bounded by Theorem [9j The sparse approximation 

error can be upper bounded by (|69|) . Moreover, in the exact sparse case (q = 0), we have V = V 

and the approximation error is zero. 

Step 3: Excess risk. The hard part is to control the third term (the worst-case fluctuation) in 
T]). To this end, we decompose the sample covariance matrix as 

S (2) = ~X' (2) X (2) = ~(VDU' (2) + Z' (2) )(U (2) DV + Z {2) ). 



Then 



n 



£ — S( 2 ) = G + H, 



(72) 



where 



G = VDQu' (2) U (2) -I r ) DV 



1 



H = I„ — -Z'^Zm - -VDU' (2) Z (2) - -Z' (2) U (2) DV. 



1»- --(2)^(2) n 



1 r rl 



(73) 
(74) 



We first deal the inner product with G: Write {G,V A V' A - V*V'^ = {G,V A V' A - W 
G.V.V; - VV'Y Note that 

G, W - V A V' A ) = /d ( -V' (2 )Vf2) ~ D,V'(W - V A V' A )V 



ii 



n 



D ( -U' (2) U (2) - I r ) D, I r - V'V A V' A Y 



< 



D[ -U' (2) U (2) - I r | D 



< 



A, 



n 

-U' (2) U (2) -I r 



Tr(I r - V'V^V) 



IW - v A v' 112 



AIIF 



(75) 
(76) 



where (f60l) is due to (f20l) and (|75l) is a consequence of Lemma El in view of the fact that I r — 
V'V^V^V is symmetric positive semi-definite while D(^U| 2 ^U( 2 ) — I r )D is symmetric. Similarly, 
we have 



g.v,:v: - W) < ^ 



Combining (|76|) and (|77|) . we arrive at 



-U( 2) U (2) -I r 



G,V A V^-V,V;\ < 2Xi 



n 



U (2) U (2) ~ l r 



|VV- %v% 



IVaV^-v.v;!! 2 



(77) 



(78) 



Next we control the inner product with H: Recall that A = supp(V) is fixed. We define a 
collection of p x p symmetric matrices indexed by B £ B(k) as follows: 



K 



B 



V A V^ - V B V' B (V A V' A - VbV'b), 

F 



(79) 
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which has zero trace and unit Frobenius norm. Recall that V* = Vb». Then 

'n,v A v' A - v*v;) = ||VaV / a-v,v / j f (h,Kb.) 

< || V A V' A - V*V'J F max |(H,K B > 

BeB(k) 



(80) 



Assembling (j72j) . (|78j) and (|80|) . we can upper bound the excess risk by 
^-S {2) ,V A V' A -V,V^ 

'g,v a v' a - v.vt) + (h, v { v' A - v v ; 

^ U (2) U (2) - If 



< 2Ai 



V A V' A - V*V'jf + T \\V A V' A - V,V'J| F 



(81) 



Now we combine the risk decomposition (|7ip with the upper bounds above to control the risk 
of our aggregated estimator V*: To simplify notation, denote 

5 = ||V*v; - VV'Hf, A = ||VV - VV'Hf, 



n 



(2) u (2) - A r 



i?= ||VV- VaV^If, M 
Assembling ([7T]) and (fSTj) . we have 

^ _ 6A1MJ <5 2 < T5 + (A 2 + R 2 ) (Jj + 6A X M^ + T(i2 + A). 



(82) 



Introduce the event E = {M < oi^}- By the assumption (|26j) . r < c"n for a sufficiently small 
constant. Then there exists a constant d > only depending on /c, such that > 2(v^ + t) + 
(\/~ + t) 2 , where t = wi2Ii£^MM, Applying Proposition [J] yields 



(83) 



Conditioning on the event E and using Lemma [51 we have 

r2 32T 2 3Ai (A 2 + R 2 ) + 4T(R + A) 

< — 7T! 1 



A 2 . 



A r 



(84) 



Recall from ()20|) that the loss function is upper bounded by r A (p — r). Taking expectation on 
both sides of (fMj) and using ([83]) together with Cauchy-Schwartz inequality, we have 



Ell V.Vl - VV 



32 ET 



* » * 

2 



/||2 
F 



4E[T(i? + A)] 



< + 3k(A 2 + Ei? 2 ) + — — ^— v ^ 1 ~ JS + r P {E°} 

<^ + (3 K + 8)(A 2 + E^) + -^. 



(85) 
(86) 
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In view of the oracle upper bound in Theorem [9l we have 

(k — r)r 



ER < C [ r A (k — r)- 



nh(\) 



(87) 



By (f69|) . if q > 0, the approximation is upper bounded by 



2 ^ SZQk 2 



A < 



^(s,p, r, n, A). 



2-g 

If q = 0, then A = 0. To control the right-hand side of ([86]) . it boils down to upper bound ET 2 . In 
the sequel we shall prove that 



ET^COL + AO-log^ 



n k 

for some absolutely constant C. Plugging (f8"T|) . (f88"j) and (fBUj) into (f8"6"j) . we arrive at 

e||v,v; - vv'ii 2 , 

C k , ep 2>2qn 2 T . , . (k — r)r r 

< — — -log-f- + — *(s,p,r,ri, A) + r A , , , N + 



(89) 



/i(A)n 2-g 
< C'^(s,p,r,n, A), 



n/i(A) c'ra/i(A) 



(90) 
(91) 



where the constant C" only depends on At. In the special case of q = 0, the approximation error is 
A = 0, which implies that the second term in (|9U|) is zero. Hence we have the following stronger 
result 



EIIV.V' -W'\\l< ° k '~~ eP 



(k — r)r r 
- log -f- + r A r-rrk + 



F - /i(A) k ' ' " nfc(A) ' c'n/i(A) 
< C'y (s,p,r,n,\) 



(92) 



where v&o is defined in f)12[) . Then (|9ip and (|92p implies the statement of the theorem for q > 
and q = respectively. 

To finish the proof of the theorem, it remains to establish (|89p . To this end, recall that Kb is 
symmetric and Tr(K^) = 0. By definitions of T and H in (|80p and (|74p respectively, we have 



T < Ti + 2T 2 , 



where we define 



rr-i A -1 

i2 = — max 

n BeB(k) 



rr A 1 

ll = — max 



VDU' (2) Z (2) ,K B 



Z (2) Z (2), K B 



— max 



z; 2) u (2) dv',k 



We shall prove that 



24k , 32fc 2 , 2 ep 62 

ET 2 < l og ^ + — log 2 ^ + -. 

n k n z '■ 



k n 



, 40k , ep 24k 2 2 ep 103 17k\ 

ETi < Ai logf + — log 2 -f + + — . 

1 n k n z k n n z J 



(93) 

(94) 
(95) 

(96) 
(97) 
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Assembling ([93]) with ([96]) - ([95]) and using the fact that (a + b) 2 < 2(a 2 + b 2 ), we arrive at 

ET 2 ^ + gET 2 

< 15 „o (1 + Al) ^ loE | + ^v|) (98) 

< 3000(1 + Ai)- log (99) 
n k 

where we used ^ log | < 1 implied by the assumption ([26]) . 

It then remains to establish ([96]) - ([97]) . Note that the collection {Kb : B E B(k)} belongs to the 
cr-algebra generated by the first sample Xm, which is independent of (Z( 2 ),U( 2 )). By conditioning 
on Xn\, we can treat {Kb : B G } as fixed matrices. 

Proof of [96]) : For each fixed -B 6 B(k), Kg 1 Z( 2 )- Applying Lemma H] we have 



n 



[Z'Z,K B )\ >2i + 



2t 2 



< 2exp(-t 2 ). 



Applying Lemma[5]with N = \B(k)\ = (£) < {^) k , a = 2,6 = -4= and c = 2, we have 

ET 2 < - ( 81og(2eA0 + -(log 2 (2A0 + 2 log(2eiV)) ] 



n 

24 8 
— log(2eA0 + ^log 2 (2A0, 



n- 



(100) 
(101) 



which implies ([96]) . 

Proof of (9l\): Fix S G B(k). Since U( 2 ) Z( 2 ), conditioned on the realization of U( 2 ), 

(VDU' (2) Z (2) ,K B ) = (KbVDU' (2) ,Z' (2)( 
is distributed according to N(0, ||KbVDU^ |||) . Therefore 



VDU' Z^K, 



(d) 



'(2)^(2)^5 

for some W ~ N(0, 1) independent of U( 2 ). 

Using the fact that ||AB|| F < ||A|| F ||B||, we have 

K B VDU' (2) < ||K B || F ||V||||D| 



K B VDV' ( 



(2) 



w 



U', 



(2) 



< \/Ai U 



(2) I 



Consequently, (VDU'^Z( 2 ),Kb) is stochastically dominated by y/Xi ||U( 2 )|| \W\. Since U( 2 ) is an 
n x r standard Gaussian matrix, Lemma [9] yields 



P{||U (2 )|| > Vn + Vf + t} < exp , t>0. 

Applying the union bound yields 

P{||U ( 2)|| \W\ > y/2(rfi + Vr)t + 2l?\ 

< P{(||U (2) || - Vn- y/r)\W\ >2t 2 } + p{|VF| > V2t\ 

< P{||U (2 )|| > \/n + Vr + V2t} + 2P||VF| > V2t} 

< 3exp(-t 2 ), 



(102) 
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which the last inequality follows from (|102p and the Chernoff bound P {W > \/2t} < ^exp(— t). 
Therefore 

( /vDTT'. . 7,,^ Ko\ 

> \/2( V« + y/r)t + 2t 2 I < 3exp(-t 2 ). 



VDU' (2) Z (2) ,K B 
Applying Lemma [5] with N = (?) yields 



ET| < ((8 + ^) 2 )log(3eiV) + 21og 2 (3iV)) 



which, in view of that r < k, implies the desired (j97l) . □ 
5.3 Proof of Theorem [7] 

Proof. We prove the theorem in three steps. First, we verify that the "whitening" procedure in 
Step 3 of the general reduction scheme can performed. Next, we investigate the signal-to-noise ratio 
of the regression problem. Finally, we derive the desired rates by using Theorem [6] and Wedin's 
sin-theta theorem 



5J. 



1° As a first step, we verify that the "whitening" step is indeed possible, which requires that 
a r (B) > 0. To this end, let J = supp(V°). Then \J\ < k* by §7$. Since B = UDV'V + Z°V°, 

cr r (B) > ff r (UDV'V°) - <ti(Z°V ) 

> CT r (U)a r (D)a r (V'V°) - <Ti(Z[}). (103) 

By Lemma [9] and ()37|) . with probability at least 1 — C/[nh(X)], 



MU)>v^(l-^-V^^)> ^(V'V°)>i. (104) 

Note that assumption (|26p implies that n > C$r. Thus we could further lower bound o~ r (U) by 
C\fn. Together with cr r (D) = \f\ r , the first term in f)103j) is thus lower bounded by C\Jn\ r with 
probability at least 1 — C/[nh(X)}. 

Turning to the second term in f)103j) . we first note that it is upper bounded by max/ C [ p ] |7-| =fc . || Z° || . 
Note that for any t > 0, we have 

P <^ max ||Z?|| > Jn+ \fk* + t } 
\lc\p],\i\=K V 



£ P{||Z?|| > V^+Jk*+t}< ( P \eM~t 2 m 



< 

icw,|j|=*; 



Upon choosing t = t* = ^J2k*log(ep/k*) + -y/2 log[n/i(A)], assumptions ([25]) and ([26]) together 
imply that 

ax(Z°j) <V^+,fk* + Vt* < y/E + Cy/^X. (105) 
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with probability at least 1 — C/[nh(X)]. 

Under the assumption that A r > Co, (|104|) and (|105p lead to oy(B) > c\/n\ r > with proba- 
bility at least 1 — C/[nh(X)]. This completes the first step in the proof. 

2° Let A = ^ARCT 1 = ^DU'BRC 1 = ^jDU'L. Then = VA holds in (ETJ) . In the 

second step, we show that there exist two constants C2 > C\ > depending only on k, such that 
with probability at least 1 — C/[nh(X)], 

CiVnA < o>(A) < 0-1 (A) < C 2 \fnX. (106) 

To this end, note that (|lU4p and assumption (J2SD imply 

MA) > >(DK(U) >^-f-- /IMS!!) > 

holds with probability at least 1 — C/[nh(X)]. Under the same assumption, Lemma [9] implies 



, l( A, < >(D W U) < (l + f- + ^/^ffl) < c^X. 

Thus, (fT06|) is established. 

3° Next we show that, conditioned on the event that (I106p holds, the signal matrix lies in 
F q (s' \p) where 

s' < saf(A) < Cs{nX) q / 2 . (107) 

To see this, note that (fT07]) trivially holds if = 0. For g G (0,2), define U q {s,p) = {0 G M pxr : 

ELl < s l- Fix <?' G (<7> 2 )- 0ne can verif y that Kq(s,p) C Jg(s 5 p) C H q i{s q '/ q ,p) (see, 

e.g., [24]). Therefore, for any V G J- q (s,p) and any matrix A, we have VA G 1-L q >(s q ' 9 ||A|| 9 ,p) C 
T q t{s q ' l q || AH 9 ',]?). Sending (/ i 5 yields VA C .Fg(s|| AH 9 , p), which implies the first inequality in 
(|107p . The second inequality follows from (|106p . 

Comparing the definitions of k* in (|13j) and k' in (|43p . we obtain that k! < Ck* whenever (|107p 
holds, where the constant C depends only on k when A > Co- 
Let E denote the event that (|106p holds. Then 



E||VV' - VV'Hl = E||VV' - VV'|||l {B} + E||VV' - W'\\ll {E c } 

Cr 



< EIIVV'- W||2.1/ E i + 



Here, the last inequality holds because the loss function is upper bounded by r and P(E C ) < 
C/[nh{X)}. 

To further bound the first term on the rightmost hand side, we note that E is completely 
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determined by U and Z°. Hence, it is non-random conditioned on U and Z°. Thus, 



E||VV' - VV'\\ll {E} < 2E l - 

C 



10 



©III. 



a? (A) 



1 {E} 



<4eii© 

raA 



eiili 



P-L{B} 



raA 
raA 



10- 0II&1 



u,z 



fc' (r + log ^) 1 



-{£} 



< 



Here, the first inequality comes from (|20p and Wedin's sin-theta theorem for SVD [52]. The second 
inequality comes from (|106p . The second last inequality comes from Theorem [6j The last inequality 
holds because on the event E, k' < Ck* and k(r + \og(ep/k)) is increasing in k. We complete the 
proof by noting that 1/A < C/h{X) holds since A > Cq. The bound C[r A (p — r)] on the risk always 
holds since they come from the upper bound on the loss function as discussed in (|2ip . □ 



6 Appendix 

6.1 Weak-fq constraint for orthogonal matrices 

In this appendix we prove Q. To see this, note that the row norm of any V £ 0(p,r) never 
exceeds one, which implies that ||V|| 3)IU < p. On the other hand, for any V G G q (s,p), due to 
the weak £ q constraint, the ordered row norm satisfies \\Vy]*\\ < Therefore r = ||V||| < 

Ej=i 1 A (j) 2/9 < s + s 2/q ^2 j>s r 2/q < where the last inequality follows from LemmaE This 
completes the proof of ©. 

6.2 Proof of Theorem [5] 

Proof. Let q £ (0, 2) and Q = @ q . Set k = k* as defined in (fT5|) . Similar to the proof of Theorem 
El it is sufficient to prove the following lower bounds separately: 

inf sup EIIVV'- VV'Hl > (108) 
v see ' nh{\) 



and 



inf sup EIIVV'- W'||| > log (109) 

v see nh(X) k 



for some constant c. 

The main idea of proving (|108p is to embed the worst-case configuration for the exact ^-sparse 
case into F q (s,p) n 0(p,r). Although this collection of matrices are not explicitly constructed, we 
can still control its weak ^-norm by choosing their center appropriately. By assumption (|15p . we 
have 

k> s >2r. (110) 
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Put 



V n 



Ir 





G 0(k,r). 



Ill 



Let a = T£- G (0, 1), where cq,c\ are absolutely constants from Lemma [TJ Let 



rk 



A r A (k — r). 



2nh(X) 

By (j50|) in Lemma [U there exists {Vi, . . . , V m } C 0(k,r), where m > 2 r ( fc ~ r ), 

min ||VjV' — V,-V'-||f > ae 

l<i<j<m ■> J" 

max||ViVi - VoVqIIf < e. 



iElm 



(112) 

(113) 
(114) 



Now we augment the dimension of V,'s by adding zero rows: Set Vj 







for all < i < m. 



Then V, G Go{k,p). Moreover, both (fTT3|) and ([TTi)) hold with Vi replaced by V,. 

Next we show that as a consequence of the choice of Vo in (lllll) . {Vi, . . . , V m } are in fact 
contained in G q (s,p). These will be the finite collection of points for applying Fano's lemma. To 
verify this, fix V G {Vi, . . . , V m }. Since || W — VoV' || F < e 2 , the ordered row norm of V satisfies 



VI - e 2 < II Vr, 



< 1, 



|Vu : 



< 



|V[jiJ| 2 = 0, i 



I r 

5***3 



r + 1, . . . , k 
k + l,...,p 



In view of assumption (|15p . r < |. Therefore -j 1 1 "V [^] ^ 1 1 § < 2r < s for all i G [2r]. If 2r < z < k, 
using the definition of e in (I112p . we have 

illV^ll^i^^^^i 1 ^ 



< 2"/ 2 A: 1 -'?/ 2 f rfe y /2 

V2n/i(A) y 



= k 

< s 

< s, 



nh(\) 
r 



q/2 



r + log -f 



9/2 



where the last inequality follows from the definition of k* in (|13j) and ()14|) . Therefore we have 
||"V|| 9)U) < s. The desired lower bound e 2 then follows from the same application of Fano's lemma 
as in the proof of Proposition [3l 

It remains to establish the lower bound (|109j) . Since (|109|) is weaker than the already proved 
lower bound (I108j) if r > log -f - . We assume in the sequel that r < log ¥ . Using the same rank-one 
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sub-collection (|55p in the proof of Theorem [3j we can use the lower bound for the rank-one case 
Theorem 2]: 



•■> 



inf sup E||VV' - Will > s 
v see ' \ nh WJ 

£s (4£> (n5) 

k ep , 

where (|115p follows from fc > s in (jllOp and (|116p is due to the definition of fc = k* in (|13p . □ 



6.3 Proof of Proposition Q] 



Proof. The proof follows from similar calculations to those in 25]. 



1° Let a± be two constants such that < a_ < 1 < a+. Define the sets 

r 

J± = {j ■ -Wi > 2a T a^logp n /n.} 



We are to show that for sufficiently large value of a in (|33p . J_ C J C J+ holds with probability 
at least 1 - C/[ra/i(A)]. 

Note that s^- ~ cr|xn/ n > where <t| = 2 + Ya=i ^ v ij- Consider the event { J_ C J}. We have 
P{J_ £ J} = p{u i6J _{4 < 2(1 + ay/\ogp n /n)}} 
^ E P i4 < 2 ( 1 + a\/logPn/n)} 



< \ p < < 



3&J- 



2(1 + qy/logpn/re) 
o-j ~ 2(1 + a+a^/logpn/n) 



< p J _ 1< ( Q + ~ 1 )« A/log Pn/n 1 

~ £rj_ \ n ~ 1 + a + ay/\ogp n /n J 
<Pn 1 - (a+ - 1)2a2{1 -° (1))/4 <C/[n/ l (A)]. 

Here, the second last inequality comes from Lemma [8] and the last inequality holds under condition 
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35]) and for sufficiently large values of a > y / 10(l + I/Mq). On the other hand, we have 
P{J £ J+} = P {u ie j= {4 > 2(1 + aVlogPn/n)}} 

<^p|fli> 2(l + ayiogp„/n) \ 



(Tj 2(1 + a_a v / logp„/n) 



< ^plx|_ 1> ( 1 - a-)oty/\ogp n ln 



ieJl 



n 1 + a-a^logpn/n 



\/2(l + a-a^/logpn/n) / (1 - a_) 2 a 2 logp n 
— I *^+ 1 / = — exp 1 



(1 - a_)aVlogPn " y 4(1 + a_a^\ogp n /n) 2 J 
< p^i-a-^-J^a-oCi))/* < C/[n/i(A)]. 

Here, the third last inequality comes from Lemma [8] and the last inequality holds under (|35p for 
sufficiently large value of a > \/W(l + 1/Mq). From the above bounds, if we choose a± properly, 
J_ C J C J+ holds with probability at least 1 — C/[nh(X)]. 

Note that for any j G J + , ||Vj„,|| 2 > Cy/logp n / (n\ 2 ). By the definition of the parameter space 
in ([6]), we have with probability at least 1 — C/[nh(X)], 



|J|s|J+l£Cs ( A /sS;)" 2£fc «' 

where k* is defined in (j 13[) . Here the last inequality holds under conditions (135p and (j36|) . It also 
depends on s satisfying Q, which is always the case. This completes the proof of the first claim in 

m 



2° To prove the second claim in (|37h . we first bound ||Vje ||p. In this proof, for any Ac \p], we 
used Va to denote the p x r matrix whose rows in A are the same as those V and rows in A c are 
all zeros. Let the jth largest row norm of V by VuiJI. Then we have 



|V^lll< £ l|Vbl*H 2 Aa-W^ (H7) 

3>\J-\ 







tJ "V nX 2 



< min (t c a.aJ 1 ^ + J—a^ 2 ^' 2 ) (118) 
~ t c >o V V nX 2 2 - 9 c / v ; 

< C[(2-g)a_a] q/ 

< e 2 K~ 2 . (119) 

Here, the last inequality holds under condition (|36p where e is a sufficiently small constant depending 
on Mi. This implies that ||Vjc || < 1 and so Vj_ has full column rank r. 
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Next, we show that on the event J_ C J C J+, 

oyCVjAV'j) > Ar/2. 

To this end, note that A r = cr r (VAV'). Hence, <T r (VjAV' ; ) > A r - ||VAV - VjAV'j\\. By similar 
argument to that leads to (|66|) . 

|| VAV' - VjAVjH < 2Ai||Vjc|| F < 2Ai||Vjc || F < 2A r e. 

Here, the second inequality holds when J_ C J, and the last inequality comes from (I119p . 

We are now ready to derive the lower bound for <7 r (VV°). To this end, for any matrix A, let 
Pa denote the projection matrix onto the column space of A. We first note that 

1 - a r 2 (W°) = ||l r - (V^'VV'V )! 

= || (I r - W)V (V°)'|| = ||P V - P. 



v°l 



(120) 



< UPv-PvjII + llPVj-Pvol 

To bound the first term, we apply Theorem 1101 to obtain that 



v. 



< 



VjAV'j - VAV 1 1 . 2Ai||V 



< 



HI V J C I|F 4Aie 

< — — = 4e. 



a r (VjAV'j) - X r /2 ~ K \ r 
To bound the second term, we first note that E(S°) = VAV + 2I P := S°. Then we have 

a r {Ypjj) = 2 + ^(VjAV'j) > 2 + A r /2. 
Following the lines in the proof of Theorem in Section 16.71 we could show that 



:i2ii 



(122) 



°V+i(S jj) < 2 + A r 



Thus, Theorem 1101 implies 



|P V , -P v °ll < 



Is 



yiO I 



a r .(S u Ji7 )-a r+1 (S° J 



Moreover, we note that Sjj = ± [VjDU'UDV'j + (Z^)'Z°j + VjDU'Zj + (Z^'UDV'J, and 
Sjj = VjAVj + 21 jj. Thus, triangle inequality leads to 



|S° 



jj 



'JJ\ 











-U'U - 1, 


+ 


n 





-(Z°j)'Z°j-2Ijj 



+ 



2-v/Ai 



U'Z 



IrjQ i 



n 



By Proposition|l]and Proposition^ with probability at least 1— C/[nh(X)], for t = y (2/n) log[n/i(A)], 

||iu'u-i r ||<2(yr + t) + (yr + i) 2 , 



(Z°j)'Z°j - 21 



./,/ 



< 



z J+ yz°j -2i J+t 



■n 



< 41 



J+l +t) + 2(i /|X! - 



71 



n 



2 , 



Iu'ZjII < 



u'z J+ 



V n V n 
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On the event such that the above bounds hold, conditions (|35p and (J36J) imply that 

\\S jj-X jj\\ <CA i£ /k< A r /4, (123) 

and hence 

» P v^v»ll<3^<C £ . (124) 
Combining (fT20D - (fT2i1l . we obtain that cr r (V'V°) > ^1 - Ce > 1/2. This completes the proof. □ 

6.4 Proof of Proposition [2] 

Proof. 1° We first show that f < r with probability at least 1 — C /[nh(X)\. To this end, note that 
P{f>r}<P{a r+1 (S° JJ )>2(l + ( 5| J |)} 

<p{ max a r+1 (S° AA )> 2(1 + 6 ]J{ )) 

[\A\=\J\ J 

E P (max ^+i(S%) >2(l + ^|), |B| =k\ 
£ P (max a r+1 (S%)> 2(1 + ^)1. 



< 

fc=r+l 



< 

fc=r+l 



By Proposition 1.2 of 22] and the interlacing property of eigenvalues in general, we have 



1 q st st 

- maxcr r+ i(S Ayl ) < max ^{Waa) < max cr^W^)- 

2 1^1=^ |A|=fc-r |A|=fc 

Here, nW ~ W p (n,I p ), i.e., the standard Wishart distribution, and < means "stochastically 
smaller". Thus, we obtain 

P{f>r}< V p{maxai(W AA ) > l + <5 fc l- 

For each summand on the right side, we have that, for liWj ~ Wk(n,Ik), 

P (max di(W AA ) > 1 + S k \ < ff) P {ai(W fe ) > 1 + 5 k } 



M\=k J 



A; / pnh(X) 

Here, the first inequality is union bound, while the second inequality comes from Proposition U] and 
the fact (^) < (ep/k) k . This last inequality holds under ([35]) with the specific choice of t k used in 
the definition of <5& in (|44p . Summing over all possible fc's, we obtain that r < r with probability 
at least 1 - C /[nh(X)\. 

2° Next we show that r > r with probability at least 1 — C/[nh(X)]. To this end, we note that 
(fl~22|) and (fT23|) imply that with probability at least 1 - C/[nh(X)\, 

a r (S°jj) > a r {TPjj) - IIS ^ - S^ll > 2 + A r /4. 



33 



Note that under conditions ([35]) and (|36j) . on the event that J_ C J C J+, A r /4 > <5iji. This 
completes the proof since the proof of Proposition [1] shows that J_ C J C J+ holds with probability 
at least 1 - C/[nh(X)]. □ 



6.5 Proof of Theorem [6] 

Proof. The proof essentially follows the classical argument in the Gaussian sequence model. 
Let K(M, 0) = ||0 - M||| + pen(M) and 

©0 = argmininM,©). 
M 

Then we have 

E||@ - 0||| < if (0o,0) + 2E(0 - 0,E), 

since 

||© - 0||| = ||Y - @||| + 2(0 - 0,E) - ||E|||, 
||Y - 0||| + pen(0) < ||Y - O ||| + pen(0 o ). 

1° Let \\0\\\{\ denote the i th largest row norm of 0. To bound K(&q,&), we note that for any 
£ J r q {s' ',p), applying Lemma [7] yields 

K(@ ,®) = sup mf[||0- M||| + pen(M)] 

n 

= mf[^ ||0||+pen(Z)] 

i=l+l 

n 

< \\0\\^+pen(k')<^k'(s'/k') 2 ^+pen(k') 
i=k'+l " 

< -^—k'ty + pen(fc') < ^—k' (r + log ^) . 
2 — q 2 — q \ k' J 

2° To bound E(0 - 0, E) , we first note that 

p 

i=i 

For any i, if ||0j|| < t 1 < t 2 , then (e i; yjl{|| yj ||2 >tl} - Gi) > {e h yil{|| yi ||2>t 2 } - Gi). This is 
because 

(ei,yil{|| y .||2 >tl } - 0i) - (ei,yil{|| yi ||2 >t2 j. - Gi) 
= < e * J yi 1 {||y 1 [|36(ti,t a ]}) 

= (llyill 2 - ^yi) 1 {||y i || 2 e(*i,t 2 ]> 

> llyill(llyill-ll^ll)i{|[ yi p 6 (t 1>t2 ]}>o. 
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Next, let k- = p A (1 + l/logp)k'. If k = |supp(0)| > then there is at least (/c_ — k') 
rows with row norms greater than the corresponding threshold. Note that the row norms follow 
(non-)central chi-square distributions with r degrees of freedom, which are stochastically larger 
than Xr- Thus, we have 

P Ck>k-)<(P\ k ' k ) n P(X 2 r>(l + S) 2 U) 

^ ' i=k'+l 

-\k'/iogp) 11 

" 6XP P 2 " Io^ )fc g A 7 + bg^ l0gl ° gP + — l0g(1 + io^ } J 

/ (0/2-1/ logp-e)fc' 

- Ve^y 

Here, the last inequality holds for any fixed e > and sufficiently large p > po( e )> since (logp) fe '/ logp = 
o((k' /ep) tk ) for any e > 0. For any (3 > 2 and k > 2, take e < |, we are led to 

P(*>M<(^) 2 . 

We now define two sets 

S = {i : ||0;|| < <5ytT}, S c = {i: \\0i\\ > S^tZ}. 

For any i G S, 

E(e»,0i - 0»)l{£< fc _). = E ( e *A)l{£< fc _} - E ( e *A)l{fc< fe _}- 
For the first term, we have 

E ( e *' 6, *> 1 {fc<fc_} - E ^'^ 1 {||y<|| 2 >(i+5) 2 t fc _}) 1 {fe<ifc-} 

= E(||ei|| 2 + e^)l| ||yj||2>(1+5)2tfc _|l|^ fc _|. 

When ||yj|| > (1 + 5)y / tk_ and \\6i\\ < 5y / ifc_ , we have ||ej|| > ||yj|| — \\0i\\ > yftiZ. > ll^ill- Thus, 
< Il e i|lll^j|l ^ ll e «l| 2 - Hence, for any % E S, 

[•CO 

E( ei ,^)l {%fc } < 2E||e,|| 2 l {i|eii|2>tfc } = 2 jf P( X 2 > t)dt 

/oo 
P(X 2 > r + \Z2rs + s 2 )(\/27 + 2s)ds 
v/*fe_ 

/oo 
e _s /2 (\/27 + 2s)ds. 
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Note that 



s - //? _ i ~ v 5 



So, 

E(ei,?i)l 
For the second term, we have 



/8/3 ep 20 ep 
1 + W — log— + — log- 1 



/Slog 



-E ( e *' tf *) 1 {fc<fc_} ~ E ( e *'^) 1 {fe>fc_} 

< E l(e^)|l{£ >fe _} < II^IIEII^II 1 !^.} 

< ||0 { || (E|| ei || 2 P(fc> fc_)) 

<fi n- r y < 6k ' tk - 

ep 

Thus, when (3 > 2, we have 

^E(e,A - Oi)l {X < fc _ } < Cp{rp- p l\k-/efl 2 + k't k _/{ep)) < Ck' (r + log^) . 

For any i G S' c , we have for some t € {ii, . . . , ir} 

(e*, 0; - 0,) = ||ei|| 2 l { || yi || 2>t} - e-0jl { || y .||2< t} 

< ll e il| 2l {||y,|| 2 ><} + ll e i|l(l|ydl + ll e ill) 1 {||y i || 2 <t} 

<C(IN| 2 + i) 

Thus, 

J] E(e i ,? i - < C E II^H 2 +^E^ CA;, ( r + lo S f )• 

ieS c ' ieS c i=l 

Here, the last inequality holds because the size of S c satisfies \S c \(5y / tk_) q < s', and so by the 

definition of k! and fc_, |S C | < 5- q s't k q/2 < 5- q k_. 

To complete the proof, we bound E||0 — 0||pl|^ >fe | as follows. First, let y^) denote the the 

row in Y with the ith largest norm, i.e., ||y(i)|| > ||y(2)|| > ■ ■ ■ , and write y(j) = 6^ + e^y Then, 



ie- ©nil 



E 



i ri {fc>fc-} - L {k>k-} + Ell 6, WII L {k>k-}- 

i<k i>k 



Note that 



E EH e MH 2l {fc>fc-} = E E ll e ^ 2l {fe>fe_} 



i=i 



= p(E|| ei || 4 P(fc>fc_)) 
= pV3^k'/(ep) < Ck'r. 



1/2 
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Moreover, \\9 {i) \\ 2 < 2(||y (i) || 2 + ||e (i) || 2 ), and so 

p 

E Eii wii 2l {^_} < 2E Eiiywii 2l {^_} + 2 E E H e ^ 2l {^-} 

i>k i>k 1=1 

< 2pt k _P(k > k_) + Ck'r 

< Ck'(t k _ + r) < Ck' (r + log |?) . 

This completes the proof. □ 
6.6 Proof of Proposition [3] 

Proof. Let 5(0, e) = {8' G : d(0,0') < e }- Let a G (0,1) and e G (0,e ] to be determined later. 
First we prove that there exists 0* G such that 

M{B(6*,e),ae) > (— ) (125) 
\acij 

which is a simple application of pigeonhole principle. To see this, let G e denote a minimal e-cover 
of 0, i.e., = AA(0, e) and = U eeG( B(0, e). Then 

Af(0, ae) = AT (U eeGe P(0, e), ae) < ^ Af{B(9, e),ae). 

6»eG £ 

Consequently, there exists 0* G G ae such that 

where the last inequality follows from (|47p . Then (|125p follows since .M(-E, e) > N{E,e) for any 
set E Q. 

In view of (j!25|) . consider the local packing set {0| , . . . ,6 m } C B(6*,e), where m > (^) d and 
ae < d(9i,8j) < 2e for any i ^ j. By Fano's lemma the probability of error for the multiple 
hypothesis testing problem {P#. : i G [m]} is lower bounded by 

mirn^ £) (P e . 1 1 P ) + log 2 4Ae 2 + log 2 

p e > 1 > 1 . 

log m d log -£2- 

Consequently the minimax estimation error admits the following lower bound 



inf su P E e [d 2 (0(X),0)] > a 2 e 2 ( 1 - ^f"^ 2 ] (126) 



for any e G (0, e ] and a G (0, 1). Pick a = ^l. Set e 2 = if eg > and e 2 = ^ if otherwise. 

Using d > 1, we obtain from (|126p the following 

infsupE,[d 2 (0(X),0)] > 4 (—^y A — 



v v ; ' c 2 V 576A "96, 
which implies the desired (jUJ). □ 
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6.7 Proof of Theorem M 



Proof. Note that the bound is trivial when r = k, so we assume r < k — 1 from now on. 
1° Define S = VD±U'UDV' + I p . Then, 

S - 5] = VD(-U'U - I r )DV'. 

n 

Thus 1 1 S - S|| < Ai ||~U'U - I r ||. By Proposition Hand Proposition El for t = yj{2/n) \og[nh(X)) 
and an absolute constant c, with probability at least 1 — C/[nh(\)], ||^U'U — I r || < 2(^/^" + t) + 
(yfn~ >r ^) 2- Under the assumption ([25]) . we have 



k 



nh{\) 



(127) 



for some sufficiently small e > 0. This leads to 

|| S - S|| < CeAi < Ap/4. 



Weyl's theorem 2CJ, Theorem 4.3.1] then implies 

ov(S ) > <r r (E) - 1 1 S - S|| > (3/4) Ar + 1. 



(128) 



Moreover, the definition of So implies that span(V) is the principal subspace of So, though the 
individual columns of V are not the leading eigenvectors, and for any I > r, 07 (So) = 1. 



2° Note that 



S - S = (-Z'Z - l k ) + -(VDU'Z + Z'UDV'). 

n n 



Thus, 



IS -Soil < 



z'z - h 



n 



+ -VAi ||U'Z|| . 
n 11 



Again, by Proposition [4] and Proposition El for t = y (2/n) log[n/i(A)] and an absolute constant 
c, with probability at least 1 - C/[nh(X)\, ||iZ'Z-I fc || < 2(y^ + t) + (\J~^ + t) 2 , and ||U'Z|| < 
n^/T^Sti^ + J% + t). Under assumption (fTZTl) . this lead to 

||S - So|| < CeAi < CKeX r < A r /4. 



When the last inequality holds, Weyl's theorem 20|, Theorem 4.3.1] leads to 

<th-i(S) < a r+1 (S ) + ||S - So|| < 1 + Ar/4. 
Therefore, let E denote the event that both (fT28|) and (fT29|) hold, then 

E||VV' - W'lll < E||VV' - VV'Ull^} + E||VV' - W|||l{ B c}. 
Since ||VV' - VV'||| < 2r, we have 



(129) 



(130) 



E||VV - W'llf l {B c } < 2rP(E) < C 



nh(X) 



:i31) 
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In what follows, we only need to bound E||VV' — W'Upl/^j.. To this end, let [VV 1 ] be an 
orthogonal matrix. We apply Theorem [10] to obtain that 

|| VV' - W|| F 1 {E} 

^ ^vTI min (IK S " So)V||f,, ||(S - S )V ± ||2)1 {£;} 

(cr r .(S ) - a r+1 {S)Y 

<-lmin(||(S-So)V|||,||(S-S )V J -|||). 



Hence, 



E||VV'-VV'|| F 1 {B} < -^mi n (E||(S - S )V|||, E||(S - S )V X ||^). (132) 



3° We now control the right hand side of (|132p . To this end, we divide into two cases: 1) 
r < k/2 and 2) r > k/2. 

First consider the case when r < k/2. In this case, we have 

(S - S )V = VV'(-Z'Z - I fe )V + V- L (V- L )'-Z'ZV + ivDU'ZV + -Z'XJV. 

n n n n 

Note that ||AB||f < || A|| ||B||f- The triangle inequality thus leads to 

ll(s - s )v|| F < ||V(-z'z - i fc )v|| F + -iKv^yz'zviip + ^||U'Z|| F . 

n n n 

Note that for any matrix W G IR nx ' with iid N(0, 1) entries 



-W'W - I r 

n 



F 



,E||W.,|g- B 2 _ EKW,. W, 2> P = P-H 
n 2 n z n 



Thus E||V'(^Z'Z — Ifc)V||p = (r 2 + r)/n. Moreover, note that for any two independent random 
matrices A G M' lXn and B G M /2Xn with iid N(0, 1) entries, 

E ||A'B||p = ZiZ 2 E|(A*i,B*i)| = lil 2 n. (134) 

Since V'V 1 = 0, ZV and ZV X are independent, and so E||(V- L )'Z'ZV|| F = (k - r)rn and 
E||U'Z|| F = rkn. Hence, 

E||(S - S )V||| < -(r 2 + r + (k — r)r + \ x kr) < —{X x + l)r(k - r). (135) 
n n 

Here, the last inequality depends on the fact that r < k/2. 

Next, consider the case when r > k/2. In this case, we have 

(S - S )V ± = VV'-Z'ZV 1 + v ± (v ± )'(iz'z - I fc )V x + ivDU'ZV^. 

n n n 

Thus, 

||(S - S )V ± || F < ||(V ± ) / (iz / Z - IjOV-HIp + -||V'Z , ZV ± || F + ^||U'ZV ± || F 

n n n 
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By (fT33|) . ElKV-Ly^Z'Z - I fc )V- L ||f, = [{k - r) 2 + (k - r)]/n. Note that ZV 1 - is an n-by-(fc - r) 
matrix with iid iV(0, 1) entries. Thus, (fTMD leads to E|| V'Z'ZV 1 1|| = (k-r)rn and EHU'ZV" 1 1|| < 
r(/c — r)n. Hence, 

E||(S - So^Hl < — ((/c - r) 2 + (k - r) + (k - r)r + \ x (k- r)r) < -(Ai + l)r(k - r). (136) 
n n 

Here, the last inequality depends on r > k/2. 

Combining (fl"30l) . (USD . (fT32ll . (fT35|) and (HMD, we complete the proof. □ 



6.8 Proof of Lemmas 

Proof of Lemma\M Since T'T = V'V = I r , we have 



(E, VV - TT'} = (VDV, W - TT'} 

= Tr(VDV') - Tr(T'VDV'T) 
= Tr(D) - Tr(DV'TT'V) 

r 

= ^Ai(l-(VTT'V) ii ) 

i=l 

> X r {r - Tr(V'TT'V)) 

= *L II w' - TT'll' , 
2 N IIf> 

where the inequality follows because (V'TT'V)jj = YJj=\ (T*j, V^) 2 < HV^H 2 , = 1. The other 
side of (|56|) follows analogously. □ 



Proof of Lemma^4\ Since K is real symmetric, we can diagonalize K as K = Y2j=i^j^j^jj where 
(tj, U) = <%, Ei=i ^ = 0> £j=i ^ = !• Th en 



;z'z,k> = <z'z,^;.> = J2 d i(W zt j 

j=i i=i 



n 



By the orthonormality of {tj}, Ztj ~ iV(0,I n ). Then ||Zi,-|| 2 ~ Xn- Let ll Z *if - £"=i Y ?i> wheie 



2 iid 2 



.2 (d) 



~ JV(0,1). Then 



v 1=1 7=1 v 



Let J+ = {j G [p] : dj > 0} and J_ = {j E [p] : dj < 0}. Applying [3l|, Lemma 4 (4.1), p. 1325] 
(with D = n\ J+|, <nj = ^L, £" =1 EjeJ+ 4 - 1 ' maXi i M - max i - 7i)> we have 



P < 



i=i jeJ+ v 



2t 2 

> 2t + — V < exp(-^), 
'n 
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which also holds with J + replaced with J_ and dj replaced with — dj. Since 
i n a n A 

jr'SJ-f ' 4=1 J 



— (Z'Z,K) 



i=i jej + v i=\ v 

applying the union bound gives the desired (I57p . 

Proof Lemma O First we assume that c = 1 . Then 

EmaxlXJ 2 = 2 / P ( max IXJ > t \tdt 
ie[N] J Q [ie[N] 



□ 



= 2 [ P (max LXjl > as + 6s 2 \(as + 6s 2 )(a + 2os)ds 

JO l^W J 

/>oo 

<4/ (JVP{|Xl| > as + bs 2 } A l)(a 2 s + 4&V)ds 

JO 

/>oo 

< 4 / (iVexp(-s 2 ) A l)(a 2 s + 4&V)ds 
Jo 

/Vlog AT /■oo 

= 4/ (a 2 s + 46 2 s 3 )ds + 4iV / exp(-s 2 )(a 2 s + 46 2 s 3 )ds 

JO JJTo^N 



(137) 



(2a 2 + 86 2 ) log (e AO + 26 2 log 2 N 



(138) 



where (|137|) follows from the union bound and the elementary inequality (as + bs 2 )(a + 26s) < 
s(a+26s) 2 < 2a 2 s+86 2 s 3 , while (fl~38"1) follows from the fact that J t °° x 3 exp(-x 2 )dx = i±£ e xp(-t 2 ). 
If c ^ 1, we simply replace A" by cA^. □ 

Proof Lemma\^ Since F is symmetric, we can diagonalize it as F = AAA', where A is an or- 
thogonal matrix and A = diag(Ai, . . . , X p ) such that |Aj| < ||F||. Write A = A + — A~, where 
Xf = X{ VO and = — (AjAO). Since E is symmetric positive definite, we have < Tr(EAA + A') = 
Tr(A'EAA+) < ||F|| Tr(A'EA) = ||F||Tr(E). Similarly, < Tr(EAA~A') < ||F||Tr(E). Therefore 
|(E,F)| = | Tr(EAA + A') — Tr(EAA~A')| < ||F||Tr(E). □ 

6.9 A Sin-Theta Theorem 

Theorem 10 (sin# theorem for symmetric matrices [3]). Let A and A + E be symmetric matrices 
satisfying 



Fq Fi 



"a " 




F " 


Ai 




Fi_ 



A + E 



Go Gi 



"a 0" 




Go" 


At 




Gi 



where [Fo Fi] and [Go Gi] are orthogonal matrices. If the eigenvalues of Ao are contained in an 
interval (a, b), and the eigenvalues of Ai are excluded from the interval (a — 5,b + 5) for some 5 > 0, 
then 



1 , ,.. mindlF^EGollF.HF^EdllF) 
71 l|FoF - G G || F < 



and 



|FoF' — G G || < 



mm (l|FiEG || , ||F' EGi| 
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6.10 Ancillary Results 

In this part, we collect a few useful tail bounds. 

Proposition 4. Let Y be an n x k matrix with i.i.d. N(0, 1) entries. For any t > 0, 

Proof. Let W nk = ±Y'Y, and tf fe (t) = 2(^/| + i) + {^l + tf. Note that o-;(W nfe ) = n" 1 / 3 ^ (Y) 
for i = 1, . . . , max(A:, n). Therefore, we have 

WW nfe ) > 1 + tf fc (i)} C {(Ji (Y) > + + Vn~t}, 
{o- fc (W nfc ) < 1 - <5 fc (i)} C {o-fc(Y) <^-Vk- s/nt}. 

Since ||W n fc|| = max{<7i(W n fc) — 1, 1 — <7fc(W n fc)}, we obtain that 

P{||W nfc || >5 k (t)} < PfaiWnk) >l + 6 k (t)} + P{a k (W nk ) <l-5 k (t)} 

< P{ai(Y) > v / ^+v / fc + ^} + P{cJ fc (Y) <Vn-Vk-Vnt} 

< 2e" 



-nt 2 /2 



Here, the second inequality follows from the inclusion established in the second last display and 
the last inequality is due to Lemma [9j This completes the proof. □ 

Proposition 5. Let Y G M. nxl and Z G K nxm 6e too independent matrices with i.i.d. N(0,1) 
entries. Then for any < a < \\fn and b > 0, 



Proof. Without loss of generality, suppose Z < m. Define Y = [yi, . . . ,yj] 3 where y^ = yi/||yj|| 
with yj the ith column of Y. Then, we obtain ||Y'Z|| < ( maxi<j<^ ||yj|| ) ||Y'Z||. Note that ||yi|| 2 
are i.i.d. Xn random variables. We apply Lemma [8] to obtain that 

i 

P{|NI > V^T^, i = < ^P{INI > V^+^}<ie- 3a2/w . 

1=1 

On the other hand, Y'Z is an I x m matrix with i.i.d. N(0, 1) entries. To see this, note that y^Z 
has iid N(0, 1) elements since y, has unit norm and Z has iid N(0, 1) entries. In addition, the 
y-Z's are mutually independent since the yVs are independent. So, Lemma [9] leads to P{ Y'Z > 
V~l + yfm + 6} < e~ b2 / 2 . Therefore, we obtain 



lY'Zll + - + " 



n \ V n \ n / 

< P{||yi|| > VnTo, < = l,...,Z} + p {||Y'Z|| > Vl + ^ + b} 

< Ze -3a 2 /16 + e -& 2 /2 ) 

completing the proof. □ 
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Lemma 8 ( 23(). Let x\ denote a Chi-square random variable with n degrees of freedom. Then 

P{x 2 n < n(l - e)} < e~ ne2/A , when < e < 1, 

P{xl > n(l + e)} < e ~ 3ne2/16 , when < e < \, 
/2 

P{X^ > nil + e)} < ^—e~ ne2/4 , when < e < n 1 / 16 , n > 16. 

Lemma 9 ([17], Theorem II. 7]). £ei Y be n x p with iid N(0, 1) entries. Ifn>p, then for any 
t > 0, 

P{a 1 (Y)>^ + ^p- + t}<e- t2 l\ 
PK(Y)<^-Vp-t}<e-' 2/2 . 
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